Getting Familiar with our Data!

We first need to load the data set I found from ‘data.world’ at [website link].

  ## Getting my Data Set
data <- read.csv("~/Library/CloudStorage/GoogleDrive-ccoonce@asu.edu/My Drive/DAT 301/ProjectModule4/inform8n-us-national-parks-visitation-1904-2016-with-boundaries/data/all_national_parks_visitation_1904_2016.csv")

First, I would like to understand the structure of this data set.

  ## Exploring my Data Set
head(data, 3)
summary(data)
##   created_by        measure_selector       year            date_edit        
##  Length:21560       Length:21560       Length:21560       Length:21560      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   scrapeurl          gis_notes           gnis_id            geometry        
##  Length:21560       Length:21560       Length:21560       Length:21560      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    metadata         number_of_records    parkname            region         
##  Length:21560       Length:21560       Length:21560       Length:21560      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     state            unit_code          unit_name          unit_type        
##  Length:21560       Length:21560       Length:21560       Length:21560      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     visitors           yearraw         
##  Min.   :        0   Length:21560      
##  1st Qu.:    39125   Class :character  
##  Median :   155219   Mode  :character  
##  Mean   :  1277105                     
##  3rd Qu.:   608144                     
##  Max.   :871922828                     
##  NA's   :4

This set has 18 columns consisting of 17 ‘character’ variables and one ‘int’ variable. The various columns contain information like park name, region, state, type of park, and number of visitors. Each observation (or row) is a different park by year.

The Problem

After seeing the structure of this set, I think the first question I would like to answer is which parks I should plan to visit based solely on popularity by number of visitors.

Exploratory Analysis

First, Let’s clean some things up by removing unnecessary columns for my analysis.

  ## Select fewer columns
data_clean <- select(data, region:yearraw)

 ## Check for missing or problem values
year_list <-data_clean %>% count(data_clean$yearraw)
head(year_list,4)
tail(year_list,2)
sum(is.na(data_clean$visitors))
## [1] 4

After looking at the smaller subset of data, the ‘yearraw’ column has a ‘Total’ section that will cause issues during analysis. I will also convert it to a numeric value rather than character value. In addition, the ‘visitors’ column has 4 values that are not available. I will remove those observations to streamline my exploration.

  ## Removing observations with NA's
data_clean <- na.omit(data_clean)


  ## Subset the data
data_totals <- subset(data_clean, yearraw == "Total")
data_clean <- subset(data_clean, yearraw != "Total")
data_clean$yearraw <- as.numeric(data_clean$yearraw)

Distance Analysis

I will start by loading and converting my hometown coordinates to the ‘usmap’ coordinate system.

  ## Coordinates for Helena, Montana
helena_coords <- data.frame(Longitude = -112.0372, Latitude = 46.5891)
helena_coords_transformed <- usmap_transform(helena_coords, input_names = c("Longitude", "Latitude"))

Now I need to use the Haversine formula which is useful for determining the shortest path between two points on the Earth using the Latitude and Longitude coordinate system.

  ## Calculate distances from Helena to each park
transformed_top_parks$distance <- distHaversine(matrix(c(helena_coords$Longitude, 
                                                helena_coords$Latitude), nrow = 1),
                                                matrix(c(transformed_top_parks$Longitude,
                                                transformed_top_parks$Latitude),
                                                ncol = 2), r=6378137)
  ## Converting to miles from Km
transformed_top_parks$distance_miles <- transformed_top_parks$distance / 1609.34

In order to plot this information I want to add my hometown coordinates to the top parks data frame.

 ## that repeats Helena's coordinates
helena_repeated <- data.frame(
  x = rep(helena_coords_transformed$x, nrow(transformed_top_parks)),
  y = rep(helena_coords_transformed$y, nrow(transformed_top_parks))
)

 ## Add Helena's coordinates to the 'transformed_top_parks' data frame
transformed_top_parks$helena_x <- helena_repeated$x
transformed_top_parks$helena_y <- helena_repeated$y

And finally, I will create the plot.

  ## Creating A second Data Frame for future plot
top_parks_cluster <- transformed_top_parks

  ## Filter to only the top parks by visitation
transformed_top_parks <- arrange(transformed_top_parks,desc(Total)) %>%
  filter(transformed_top_parks$Total >= 60000000)
## Now plot using the 'transformed_top_parks' dataframe
p6 <- base_map +
  geom_segment(data = transformed_top_parks, 
               aes(x = helena_x, y = helena_y, xend = x, 
               yend = y, color = distance_miles), 
               arrow = arrow(length = unit(0.1, "inches"))) +
  scale_color_viridis_b(name = "Distance (miles)") +
  coord_quickmap() +
  labs(title = "Distance from Helena, MT", x = NULL, y = NULL) +
    theme(legend.position = "right",
        text = element_text(size = 12), 
        title = element_text(size = 14),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.ticks = element_blank(), 
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        panel.background = element_rect(fill = "white", color = "white"))

p6

Defining the problem

After seeing this visualization, I want to pivot to a new problem. How I can optimize seeing multiple parks per trip rather than trying to choose one park. There are to0 many nearby! I need to use a clustering model to group the parks by looking at their distances from each other. To do this I need to use a hierarchical cluster function.

  ## Perform a hierarchical clustering
  
  ## Creating distance matrix
dist_matrix <- dist(top_parks_cluster[,c('Longitude', 'Latitude')])
clusters <- hclust(dist_matrix)

  ## Split into clusters of parks
park_clusters <- cutree(clusters, k = 5)
top_parks_cluster$cluster <- park_clusters

With my new set of clustered parks, I want to see how the clusters worked out.

p7 <- base_map +
  geom_point(data = top_parks_cluster, aes(x = x, y = y, color = factor(cluster)), size = 2, alpha = 0.9) +
  scale_color_brewer(palette = "Set1") +  # Use a qualitative color palette
  labs(title = "Cluster Analysis of National Parks", 
       color = "Cluster") +
  theme_minimal() +
  coord_quickmap() +
    labs(title = "Cluster Results", x = NULL, y = NULL) +
    theme(legend.position = "right",
        text = element_text(size = 12), 
        title = element_text(size = 14),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.ticks = element_blank(), 
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        panel.background = element_rect(fill = "white", color = "white"))

p7

Not a bad result, but there are a few parks I want to change to a different cluster.

top_parks_cluster$cluster[12] <- 1
top_parks_cluster$cluster[33] <- 1
top_parks_cluster <- top_parks_cluster[-21, ]
top_parks_cluster <- top_parks_cluster[-6, ]
top_parks_cluster <- top_parks_cluster[-3, ]
p7.1 <- base_map +
  geom_point(data = top_parks_cluster, aes(x = x, y = y, color = factor(cluster)), size = 2, alpha = 0.9) +
  scale_color_brewer(palette = "Set1") +  # Use a qualitative color palette
  labs(title = "Cluster Analysis of National Parks", 
       color = "Cluster") +
  theme_minimal() +
  labs(title = "Adjusted Cluster Results", x = NULL, y = NULL) +
    theme(legend.position = "right",
        text = element_text(size = 12), 
        title = element_text(size = 14),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.ticks = element_blank(), 
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        panel.background = element_rect(fill = "white", color = "white"))

p7.1

Better! Now I want to know what the most efficient way to drive from park to park is so I will use the Traveling Salesman Problem (TSP). The TSP finds shortest possible route that visits each city exactly once and returns to the original city. I would likely be able to get a more accurate route optimization if I was using road directions, rather than straight distances. I think that a basic TSP is more than enough for this situation.

  ## Initialize an empty list to store results
tsp_results <- list()

  ## Loop through each cluster
for (i in 1:max(top_parks_cluster$cluster)) {
  ## Subset parks for the current cluster
  cluster_parks <- top_parks_cluster[top_parks_cluster$cluster == i, ]
  
  ## Calculate the distance matrix for the current cluster
  dist_matrix <- as.dist(distm(cluster_parks[, c("Longitude", "Latitude")], 
                               cluster_parks[, c("Longitude", "Latitude")], 
                               fun = distHaversine))
  
  ## Solve the TSP problem using the nearest neighbor heuristic
  tsp_solution <- solve_TSP(TSP(dist_matrix), method = "nearest_insertion")
  
  ## Solution List
  tsp_results[[paste("Cluster", i)]] <- tsp_solution
}

Now that I have the results, I can visualize what the optimal route is. In order to do that, I want to plot them all on the same map which will require creating a function to compile all the data points into one table.

  ## Function to create segments from the TSP solution
get_tsp_segments <- function(cluster_number, top_parks_cluster, tsp_results) {
  
  ## Filter the parks by the selected cluster
selected_cluster <- filter(top_parks_cluster, cluster == cluster_number)

  ## Retrieve the TSP order for the cluster
tsp_order <- tsp_results[[paste("Cluster", cluster_number)]]
tsp_solution_order <- TOUR(tsp_order)

  ## Order the parks according to the TSP solution
ordered_cluster <- selected_cluster[tsp_solution_order, ]

  ## Close the loop by returning to the starting point
ordered_cluster <- rbind(ordered_cluster, ordered_cluster[1, ])

  ## Create segments for each leg of the TSP path
segments <- data.frame(cluster = cluster_number,
                       x = ordered_cluster$x[-nrow(ordered_cluster)], 
                       y = ordered_cluster$y[-nrow(ordered_cluster)], 
                       xend = ordered_cluster$x[-1], 
                       yend = ordered_cluster$y[-1]
                       )
  
  return(segments)
}

  ## Call the function
all_segments <- lapply(1:max(top_parks_cluster$cluster), function(cluster_num) {
  get_tsp_segments(cluster_num, top_parks_cluster, tsp_results)
})

  ## Combine all segments into one data frame
all_segments_df <- do.call(rbind, all_segments)
  ## Creating plot 8
p8 <- base_map +
  geom_segment(data = all_segments_df, 
               aes(x = x, y = y, xend = xend, yend = yend, 
               group = cluster, color = as.factor(cluster),
               ), 
               linewidth = unit(0.6, "cm"), 
               arrow = arrow(length = unit(0.2, "cm"))) +
  scale_color_brewer(palette = "Set1", name = "Routes") +
  geom_point(data = all_segments_df, 
             aes(x = x, y = y, group = cluster), 
             size = .25) +
    labs(title = "Multi-park Trip Options", x = NULL, y = NULL) +
    theme(legend.position = "right",
          legend.background = element_rect(fill = "white", 
                                           color = "white"),
          legend.box.background = element_rect(fill = "white"),
          legend.key = element_rect(fill = "white", colour = "white"),
          text = element_text(size = 12), 
          title = element_text(size = 14),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          axis.ticks = element_blank(), 
          axis.text.x = element_blank(),
          axis.text.y = element_blank(),
          panel.background = element_rect(fill = "white"))

p8